Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest

As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-05105-y.

The emergence of new variants of the SARS-CoV-2 virus poses a threat to the progress made by ongoing vaccination campaigns against COVID-19. Therefore, the detection and possible identification of newly emerging variants of the SARS-CoV-2 virus in (close to) real time is of great interest.
Currently, a tool called "genomic surveillance" is used by the Centers for Disease Control (CDC) to detect new variants [4]. This is done both through the National SARS-CoV-2 Strain Surveillance (NS3) program, as well as through commercial and academic laboratories contracted by the CDC, where genetic information of SARS-CoV-2 specimen are analyzed and classified into variants. By definition, a variant is characterized by having one or more mutations which differentiate it from other variants of the SARS-CoV-2 virus [5]. A group of variants with similar genetic changes (a lineage) can be classified as a variant of concern (VOC) or a variant of interest (VOI) if they share characteristics that potentially necessitate public health action. For example, the U.S. government SARS-CoV-2 Interagency Group (SIG) classified omicron as a Variant of Concern (VOC) on 30 November 2021 due to the fact that omicron emerged in multiple countries without apparent travel history, the replacement of certain delta variants as predominant variants in South Africa by omicron, and its number of mutations in the spike protein which indicated a reduced susceptibility to sera from vaccinated individuals and certain monoclonal antibody treatments. The purpose of this article is to explore the ability of new unsupervised learning methodology to detect emerging variants of interest.
As shown previously in the literature [6,7], an unsupervised cluster analysis in which the similarity of all pairs of nucleotide sequences is assessed using the Jaccard index, and subsequent application of principal component analysis to the Jaccard similarity matrix, results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, in [8] the authors notice that nucleotide sequences the omicron variant cluster among sequences stemming from variants identified earlier on during the pandemic. Due to the fact that the aforementioned unsupervised approaches successfully clustered nucleotide sequences by strain or clade, and revealed features of the omicron variant, we likewise focus on an unsupervised approach based on the Jaccard similarity matrix in connection with principal component analysis in this work.
This finding immediately prompts the question whether the nucleotide sequences belonging to common variants can be identified by unsupervised outlier detection. In this article, we investigate this question by applying outlier detection to nucleotide sequences, both before the emergence of a variant and after a variant has emerged. We demonstrate that indeed, the number of detected outliers often increases shortly after the emergence of a new variant, and that nucleotide sequences of common variants can be identified solely based on a statistical outlier criterion.
Our findings could have important implications for the automated, unsupervised identifications of SARS-CoV-2 strains. We argue that outlier detection might be a useful surveillance tool to identify emerging variants of interest in real time as the pandemic progresses. This is also important for vaccination strategies, to identify emerging variants that may be resistant to available vaccines [9].
The article is structured as follows. The "Methods" section introduces the methodology we use for this article, starting with data acquisition and cleaning, and how the similarity of sequences is assessed. We then describe the outlier detection method we use. The "Results" section presents our findings on the clustering and outlier detection of SARS-CoV-2 nucleotide sequences. The article concludes with a "Discussion" section.

Methods
In this section, we highlight methodological features of the analysis. In particular, we describe data acquisition and cleaning ("Data acquisition and cleaning" section), the assessment of the similarity of nucleotide sequences ("Assessing the similarity of nucleotide sequences" section), the methods used for outlier detection among sequences ("Outlier detection" section), and the calibration of the outlier detection ("Calibration" section).

Data acquisition and cleaning
All findings reported in this article are based on an image of all available SARS-CoV-2 nucleotide sequences in the GISAID database [1,2] until 28 March 2022, consisting of 211,167 sequences having accession numbers in the range of EPI_ISL_403962-EPI_ISL_11498019. By timestamp we always refer to the collection date on GISAID. Sequences are only included in the analysis if they satisfy the four data quality attributes on GISAID. To be precise, all nucleotide sequences have to satisfy the criterion of being complete (defined as sequences having length at least 29,000 bp), high coverage (defined as sequences with less than 1% N-bases), with patient status (defined as sequences with meta information consisting of age, sex, and patient status), and collection date complete (defined as sequences with a complete year-month-day collection date) (Additional file 1).
We aim to investigate if it is possible to detect sequences of a new variant among the other sequences in circulation upon emergence of that new variant. We consider eight common SARS-CoV-2 variants available on GISAID. Those are alpha (B.1.1.7), beta   (Table 1).
To detect a new variant, we generate two reference datasets for each variant. For the first dataset, we determine the timepoint T 1 at which the first sequences of each variant under consideration emerge on GISAID. We then generate the first reference dataset using only sequences from GISAID with a timestamp before T 1 . The second dataset emulates the emergence of a new variant. For this we determine the timepoint T 2 at which 10% of all the sequences of a variant under consideration are available on GISAID (the threshold of 10% is arbitrary). We then generate the second reference dataset using only sequences from GISAID with a timestamp up to T 2 . The details of the reference dataset up to T 1 are given in Table 2, the sequences we aim to detect for each variant are given in Table 3, and the combined dataset simulating the emergence of each variant up to timepoint T 2 is given in Table 4. As before, the timestamps T 1 and T 2 mentioned in the article and in Tables 2, 3 and 4 refer to the collection date on GISAID.
Our planned subsequent computations on the nucleotide sequences (the calculation of the principal components of the Jaccard similarity matrix) are too computationally intensive to be carried out for all available sequences on GISAID. For this  reason, we down-sample each dataset by drawing an unbiased sample of size 10,000 without replacement.
Using the alignment tool MAFFT [10] and the official SARS-CoV-2 reference sequence (available on GISAID under the accession number EPI_ISL_402124), we align all n sequences to the reference genome. We employed MAFFT with the keeplength option in order to obtain a well-defined window (of length L = 29,891 base pairs) for comparison of all sequences. All other parameters of MAFFT were kept at their default values.

Assessing the similarity of nucleotide sequences
We next convert all sequences into a binary Hamming matrix X ∈ B n x L (where B = {0,1} is the set of binary numbers) as follows. We compare the reference genome to each aligned nucleotide sequence, and set X ij = 1 if the sequence with number i differs at position j from the reference sequence. Otherwise, we set X ij = 0. Here, the number of rows of X is set to the number of nucleotide sequences, and L = 29,891 is the number of base pairs in the comparison window. The row sums of X correspond to the Hamming distance of each nucleotide sequence to the reference genome. This methodology has already been used in the literature [6][7][8]11].
We employ the Jaccard similarity measure [12][13][14] to assess the similarity of all pairs of sequences. To be precise, each entry (i,j) of the Jaccard matrix J(X) ∈ R n x n (having n rows and n columns) is a measure of similarity between the binary rows i and j of X. An entry (i,j) of J(X) of zero encodes that the two genomes do not share any deviations from the reference genome, while an entry of one encodes equality of rows i and j of X. We employ the R-package "locStra", available on CRAN [15,16], to compute the Jaccard matrix.
For all figures included in this work, we visualize the Jaccard similarity measures by computing its first two principal components. We plot the first principal component against the second principal component, thus effectively interpreting the entries of the first eigenvector as x-coordinates, and the ones of the second eigenvector as Table 4 Combined dataset consisting of both Tables 2 and 3, subsampled again to size 10,000 Range of accession numbers extracted from the GISAID database, their time stamps, and the total number of sequences included y-coordinates. We color each point according to either a time stamp, according to its cluster membership, or according to whether it is an outlier.

Outlier detection
We are interested in detecting sequences falling into neighborhoods or clusters in which they are classified as outliers (subject to a certain criterion). To be precise, we are interested in sequences falling into neighborhoods consisting of sequences having much older (or newer) time stamps. We aim to utilize an approach which is not dependent on previously identified clusters. One way to achieve this is to define a local environment of radius eps > 0 around each sequence in a principal component plot (each sequence corresponds to a point in the principal component plot), and to consider all other (that is, similar) sequences falling into that local environment. Comparing the time stamp of the sequence under consideration to the distribution of timestamp in the local environment allows one to define an outlier. We say that a sequence is an outlier in its local environment if its time stamp is more than f > 0 standard deviations from the mean date in the environment.

Calibration
Our clustering approach depends on two tuning parameters, the radius of the local environment eps, and the factor f that specifies how many standard deviations away from the mean date are needed to define a sequence as an outlier. To calibrate both parameters, we look at the number of outliers which are identified in the data as a function of both eps and f. This results in a typical "elbow" plot, though here in two dimensions (see Fig. 2). For small values of f, meaning values close to the mean, many outliers are flagged. As f increases, fewer and fewer outliers are identified. The decrease is usually not linear.  Table 4). First two principal components of the Jaccard matrix, colored by the collection time stamp of each nucleotide sequence. The color scale encodes early (green) to late (red) sequences according to the color scheme shown on the right. Sequences of the omicron variant (see Table 3) are highlighted as triangles Instead, the number of outliers usually drops rapidly at a certain cutoff f before leveling off, thus giving the plot its name. The point at which the plot levels off can be used to determine f. We apply the elbow method to both set the parameter f, as well as the parameter eps.   Table 4). First two principal components of the Jaccard matrix with subsequent local outlier detection approach. Parameters eps = 1e-2 (the neighborhood radius) and f = 1.5 (the multiplier for the standard deviations). Outliers depicted as yellow triangles

Results
We first focus on the newest variant, omicron. Figure 1 shows a plot of the first two principal components of the Jaccard matrix as outlined in section "Assessing the similarity of nucleotide sequences". As observed previously [8] the genomes from GISAID exhibit a particular progression pattern, with older sequences (green) clustering in the middle of the plot, while newer samples (red) cluster at the bottom of the plot. The progression of genomes seems to take place from the early point cloud (green, middle), to genomes with intermediate timestamps (top), to new samples (red, bottom). As also observed in the aforementioned publication, genomes of the omicron strain are most similar to genomes in stemming from early on in the pandemic. This is visible from Fig. 1 as omicron samples (triangles) fall into a point cloud of early (green) genomes.
Interestingly, the observations for Fig. 1 are virtually identical with the ones made in [8], even though both experiments are made with independent, and thus entirely  Table 2) and after (top right, see Table 4) the emergence of the alpha variant, where sequences of the alpha variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the alpha variant, with outliers depicted as yellow triangles different, subsamples without replacement of size 10,000 taken from all complete sequences available on GISAID.
Before applying the approach of "Outlier detection" section, we calibrate the outlier detection on the omicron data as outlined in section "Calibration". Figure 2 shows the two dimensional elbow plot of the number of flagged outliers as a function of both the radius of the local environment eps and the parameter f. We indeed observe a distinct shape of the decrease in the number of outliers as the parameter f increases, with a sharp decrease at around f = 1.2, after which the plot levels off. Interestingly, the algorithm is rather insensitive to the choice of the local environment eps, apart from the case eps = 0. We repeated the calibration for the other variants as well. Interestingly, the parameters f = 1.2 and eps = 1e-1 emerge as consistent choices for all variants. Therefore, we use f = 1.2 and eps = 1e-1 in the remainder of the section.
After calibration, we aim to identify outliers using the local detection approach of "Outlier detection" section. Figure 3 shows the same principal components as Fig. 1 for the omicron variant, though this time without any coloring by timestamp. Instead, all points in yellow have the property that they pass the local outlier criterion of "Outlier  Table 2) and after (top right, see Table 4) the emergence of the beta variant, where sequences of the beta variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the beta variant, with outliers depicted as yellow triangles detection" section, meaning that they are outliers in a local epsilon environment centered around them, subject to the calibration of "Calibration" section.
Interestingly, using the same calibration, a number of other sequences not belonging to the omicron strain are flagged in Fig. 3. These belong to the delta variant of the SARS-CoV-2 virus. In what way these samples differ from the other delta variant samples in Fig. 3 remains an important question of future work.
Next, we investigate the behavior of the outlier detection upon the emergence of a new variant. We are especially interested if an increase in outliers can be detected upon the emergence of a new variant. To this end, for each variant under investigation (alpha, beta, delta, gamma, GH, lambda, mu, omicron), we apply the same calibrated outlier detected to first the reference dataset before the emergence of each variant, and after the emergence of each variant. Figures 4, 5, 6, 7, 8, 9, 10 and 11 show results for all eight variants (alpha, beta, delta, gamma, GH, lambda, mu, omicron). The left column always corresponds to the time period before the emergence of each variant, and the right column corresponds to the time period after the emergence of each variant. The top plots show the first two principal components with highlighted sequences for each variant  Table 2) and after (top right, see Table 4) the emergence of the delta variant, where sequences of the delta variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the delta variant, with outliers depicted as yellow triangles under consideration, the bottom plots show the local outliers as yellow triangles. We observe that for the beta, delta, GH, and omicron variants the number of detected outliers considerably increases after the emergence of the variant. For the other variants, the change in the number of outliers is less pronounced. For the gamma variant, the number of detected outliers considerably decreases after the emergence of the variant.
To concretize results, Table 1 summarizes the total number of detected outliers, the number of detected genomes per variant, and the number of genomes for each variant that is included in the dataset (and that can possibly be detected). We observe that for the common variants beta, delta, GH, and omicron, the detection of the emergence of a new strain is possible. Clearly the biological importance of a new variant cannot be assessed via outlier detection, but the proposed method would have been able to flag these strains as variants of interest.
Interestingly, Table 1 shows that the number of outliers before emergence of a variant varies widely among variants. This is due to the fact that the reference datasets are  Table 2) and after (top right, see Table 4) the emergence of the gamma variant, where sequences of the gamma variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the gamma variant, with outliers depicted as yellow triangles independently subsampled from GISAID in order to match the timepoint T 1 at which each variant occurs first. With our results we aim to demonstrate that a surge in outliers can happen upon emergence of a variant, meaning that the (relative) difference in the number of outliers is of interest and could be indicative of a change in the dynamics of the pandemic.
It is noteworthy to point out that in the case of Fig. 10, the plot of the first two principal components changes before and after the emergence of a variant. This is attributed to how eigenvectors (principal components) change when perturbing a matrix (for instance, [17] provides a bound on the angle of the perturbed eigenvector). Therefore, adding more data from GISAID to the computation of the Hamming matrix and the subsequent computation of the Jaccard matrix can change the Jaccard matrix and its eigenvectors.
Finally, we also consider a control case in which no new variant occurs. Figure 12 shows an example of this scenario using the alpha variant. To prepare Fig. 12, we  Table 2) and after (top right, see Table 4) the emergence of the GH variant, where sequences of the GH variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the GH variant, with outliers depicted as yellow triangles divided the reference dataset for the alpha variant (see Table 2) into two parts. The first contains the first 5000 sequences in sorted order of their timestamps, while the second part contains the later 5000 sequences. As before, we observe a certain number of outliers in the first dataset (Fig. 12, bottom left). In contrast to the other figures, sequences highlighted in Fig. 12 (bottom left) are not highlighted again in Fig. 12 (bottom right), confirming in this example that a surge would not be detected at this point in time.

Discussion
In this work, we demonstrate that nucleotide sequences of common virus strains/variants can be identified solely based on a statistical outlier criterion in real time. To this end, we prepare two reference datasets, one before and one after the emergence of eight  Table 2) and after (top right, see Table 4) the emergence of the lambda variant, where sequences of the lambda variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the lambda variant, with outliers depicted as yellow triangles common SARS-CoV-2 variants (alpha, beta, delta, gamma, GH, lambda, mu, omicron) available on the GISAID database, and apply an outlier detection method to those datasets.
Using the proposed local outlier detection approach, we can identify genomes belonging to the beta, delta, GH, and omicron strain upon emergence of these variants. However, this detection comes at the cost of a larger number of false positives. The nature of those other nucleotide sequences that pass our outlier criteria, and in what way they differ from other sequences of the most common SARS-CoV-2 variants, is an important direction of ongoing research.
The large number of false positives we observe when applying outlier detection to nucleotide sequences can pose a problem for the task of accurately highlighting newly emerging sequences. The primary aim of this proposed methodology is for use as an online screening tool, or warning system, to detect the emergence of a new variant through an increase in outliers. Additional work would be required to confirm which outliers are newly emerging variants of concern.  Table 2) and after (top right, see Table 4) the emergence of the mu variant, where sequences of the mu variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the mu variant, with outliers depicted as yellow triangles In our study we aim to demonstrate the usefulness of the proposed methodology for prediction. However, not all mathematical models are useful prediction tools. Various prediction models have been proposed since the start of the pandemic, with various success. For instance, some models forecasted that SARS-CoV-2 would not develop any variants with distinct pathologies [18], while others concluded based on hidden Markov models that certain variants with deleterious mutations go extinct [19]. A comprehensive and retrospect assessment of the accuracy of (non-pharmacological intervention) models for the case of Sweden can be found in [20], where the authors conclude that some models significantly overestimated the virus spread.
Importantly, this research shows that outlier detection might be a useful tool to identify emerging variants in real time as the pandemic progresses, using machine learning techniques and purely statistical methods only.  Table 2) and after (top right, see Table 4) the emergence of the omicron variant, where sequences of the omicron variant (see Table 3) are highlighted as triangles. Color scheme as in Fig. 1. Local outlier detection applied before (bottom left) and after (bottom right) the emergence of the omicron variant, with outliers depicted as yellow triangles